10 research outputs found
ECAPA-TDNN Embeddings for Speaker Diarization
Learning robust speaker embeddings is a crucial step in speaker diarization.
Deep neural networks can accurately capture speaker discriminative
characteristics and popular deep embeddings such as x-vectors are nowadays a
fundamental component of modern diarization systems. Recently, some
improvements over the standard TDNN architecture used for x-vectors have been
proposed. The ECAPA-TDNN model, for instance, has shown impressive performance
in the speaker verification domain, thanks to a carefully designed neural
model.
In this work, we extend, for the first time, the use of the ECAPA-TDNN model
to speaker diarization. Moreover, we improved its robustness with a powerful
augmentation scheme that concatenates several contaminated versions of the same
signal within the same training batch. The ECAPA-TDNN model turned out to
provide robust speaker embeddings under both close-talking and distant-talking
conditions. Our results on the popular AMI meeting corpus show that our system
significantly outperforms recently proposed approaches
Improved Cross-Lingual Transfer Learning For Automatic Speech Translation
Research in multilingual speech-to-text translation is topical. Having a
single model that supports multiple translation tasks is desirable. The goal of
this work it to improve cross-lingual transfer learning in multilingual
speech-to-text translation via semantic knowledge distillation. We show that by
initializing the encoder of the encoder-decoder sequence-to-sequence
translation model with SAMU-XLS-R, a multilingual speech transformer encoder
trained using multi-modal (speech-text) semantic knowledge distillation, we
achieve significantly better cross-lingual task knowledge transfer than the
baseline XLS-R, a multilingual speech transformer encoder trained via
self-supervised learning. We demonstrate the effectiveness of our approach on
two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation
tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU
points over the baselines. In the zero-shot translation scenario, we achieve an
average gain of 18.8 and 11.9 average BLEU points on unseen medium and
low-resource languages. We make similar observations on Europarl speech
translation benchmark
Incremental Transfer Learning in Two-pass Information Bottleneck Based Speaker Diarization System for Meetings
The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while di-arizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This paper attempts to improve the RTF of the TPIB system using an incremental transfer learning approach where the parameters learned by the ANN from other conversations are updated using current conversation rather than learning parameters from scratch. This reduces the RTF significantly. The effectiveness of the proposed approach compared to the baseline IB and the TPIB systems is demonstrated on standard NIST and AMI conversational meeting datasets. With a minor degradation in performance, the proposed system shows a significant improvement of 33.07% and 24.45% in RTF with respect to TPIB system on the NIST RT-04Eval and AMI-1 datasets, respectively
ECAPA-TDNN embeddings for speaker diarization
Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model.
In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminated versions of the same signal within the same training batch. The ECAPA-TDNN model turned out to provide robust speaker embeddings under both close-talking and distant-talking conditions. Our results on the popular AMI meeting corpus show that our system significantly outperforms recently proposed approaches
SpeechBrain: A General-Purpose Speech Toolkit
PreprintSpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies